Our objective is video retrieval based on natural language queries. Inaddition, we consider the analogous problem of retrieving sentences orgenerating descriptions given an input video. Recent work has addressed theproblem by embedding visual and textual inputs into a common space wheresemantic similarities correlate to distances. We also adopt the embeddingapproach, and make the following contributions: First, we utilize web imagesearch in sentence embedding process to disambiguate fine-grained visualconcepts. Second, we propose embedding models for sentence, image, and videoinputs whose parameters are learned simultaneously. Finally, we show how theproposed model can be applied to description generation. Overall, we observe aclear improvement over the state-of-the-art methods in the video and sentenceretrieval tasks. In description generation, the performance level is comparableto the current state-of-the-art, although our embeddings were trained for theretrieval tasks.
展开▼